An efficient DNA sequence searching method using position specific weighting scheme
نویسندگان
چکیده
Exact match queries, wildcard match queries, and kmismatch queries are widely used in various molecular biology applications including the searching of ESTs (Expressed Sequence Tags) and DNA transcription factors. In this paper, we suggest an efficient indexing and processing mechanism for such queries. Our indexing method places a sliding window at every possible location of a DNA sequence and extracts its signature by considering the occurrence frequency of each nucleotide. It then stores a set of signatures using a multi-dimensional index such as the R*-tree. Also, by assigning a weight to each position of a window, it prevents signatures from being concentrated around a few spots in indexing space. Our query processing method converts a query sequence into a multi-dimensional rectangle and searches the index for the signatures overlapping with the rectangle. Experiments with real biological data sets have revealed that the proposed approach is at least 4.4 times, 2.1 times, and several orders of magnitude faster than the previous one in performing exact match, wildcard match, and k-mismatch queries, respectively.
منابع مشابه
Maximum Entropy Weighting of Aligned Sequences of Proteins or DNA
In a family of proteins or other biological sequences like DNA the various subfamilies are often very unevenly represented. For this reason a scheme for assigning weights to each sequence can greatly improve performance at tasks such as database searching with profiles or other consensus models based on multiple alignments. A new weighting scheme for this type of database search is proposed. In...
متن کاملAn improved Gibbs sampling method for motif discovery via sequence weighting.
The discovery of motifs in DNA sequences remains a fundamental and challenging problem in computational molecular biology and regulatory genomics, although a large number of computational methods have been proposed in the past decade. Among these methods, the Gibbs sampling strategy has shown great promise and is routinely used for finding regulatory motif elements in the promoter regions of co...
متن کاملLot Streaming in No-wait Multi Product Flowshop Considering Sequence Dependent Setup Times and Position Based Learning Factors
This paper considers a no-wait multi product flowshop scheduling problem with sequence dependent setup times. Lot streaming divide the lots of products into portions called sublots in order to reduce the lead times and work-in-process, and increase the machine utilization rates. The objective is to minimize the makespan. To clarify the system, mathematical model of the problem is presented. Sin...
متن کاملImage Encryption by Using Combination of DNA Sequence and Lattice Map
In recent years, the advancement of digital technology has led to an increase in data transmission on the Internet. Security of images is one of the biggest concern of many researchers. Therefore, numerous algorithms have been presented for image encryption. An efficient encryption algorithm should have high security and low search time along with high complexity.DNA encryption is one of the fa...
متن کاملMaximum Entropy Weighting of Aligned Sequencesof Proteins or
In a family of proteins or other biological sequences like DNA the various subfamilies are often very unevenly represented. For this reason a scheme for assigning weights to each sequence can greatly improve performance at tasks such as database searching with prooles or other consensus models based on multiple alignments. A new weighting scheme for this type of database search is proposed. In ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- J. Information Science
دوره 32 شماره
صفحات -
تاریخ انتشار 2006